Author
Name Claire Descombes
Affiliation Universitätsklinik für Neurochirurgie, Inselspital Bern
Degree MSc Statistics and Data Science, University of Bern
Contact claire.descombes@insel.ch

The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.

1 Basics of coding in R

R is a free and open source statistical computing and graphics software. RStudio is a user-friendly environment for R, designed to facilitate its accessibility. R can technically be used without RStudio (although I wouldn’t advise it), but the reverse is not possible. To download both, follow the links below:

✏️ Download both softwares.

Once you have downloaded both programs and opened RStudio, you will be presented with a window similar to the one shown in the following figure.

  • Left pane: Contains the Console, Terminal and Background Jobs tabs.
  • Top right pane: contains the Environment, History, Connections and Tutorial tabs.
  • Bottom right pane: contains the Files, Plots, Packages, Help, Viewer and Presentation tabs.

1.1 Console

In the Console tab, we first see information about the version of R we are using and some basic commands to try out. At the end of these descriptions, we can type our R code, press Enter and see the result below the code line.

2+2
## [1] 4

1.2 Help

The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.

You can access help directly from the console or via the Help tab in the bottom right-hand corner.

help(c)
# or equivalently
? c

1.3 Script file

However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in the console.

To start recording a script, click File – New File – R Script. This will open a text editor in the top-left corner of the RStudio interface (above the Console tab, see following figure).

✏️ Create your own script. Feel free to take notes directly in it. You’ll use this script as a working document to complete various small tasks and exercises.

☑️ All exercises have an example solution at the end of the chapter.

💡 When your code starts to get long or complex, consider breaking it down into separate scripts with clear and specific purposes — for example: 1_data_import.R, 2_data_cleaning.R, 3_survival_analysis.R, 4_qol_analysis.R, etc.

1.4 Comments

Comments can be added to the code in a script using the hash symbol #.

# Here is a comment.

It is very, very important that you always comment every piece of your code, to make sure:

  • that you will still be able to understand what you have written after a few months/years.
  • to facilitate sharing: without comments, it will take someone much longer to understand your code.

So, for scientific purposes, please comment your code!
Here’s an example of how I usually comment the scripts I use in my daily work:

################################
# TOSCAN 2.0: Matching algorithm
################################

# 1) Script info-header --------------------------------------------------------

# Project:                  TOSCAN 2.0
# Author:                   Claire Descombes
# Contact:                  claire.descombes@insel.ch
# Date last modification:   07/05/2025
# Purpose:                  Match TOSCAN cohort to Swiss population using BFS data
# Environment:              R version 4.4.2
#                           RStudio 2024.09.1+394 "Cranberry Hibiscus" Release

# 2) Packages & environment ----------------------------------------------------

library(duckdb)      # Interface to DuckDB, an in-process SQL OLAP database engine for fast queries on large datasets
library(dplyr)       # Grammar of data manipulation for data frames (select, filter, mutate, etc.)
library(lubridate)   # Makes working with dates and times easier (e.g., extracting year, month, parsing dates)
library(arrow)       # Provides access to Apache Arrow tools, including reading/writing Parquet files
library(readxl)      # Imports Excel files (.xls and .xlsx) into R
library(progress)    # Adds progress bars to loops and long operations in the console
library(glue)        # Facilitates string interpolation, especially useful for building SQL queries dynamically

options(scipen = 999)  # Prevents scientific notation (e.g., 1e+05) when printing numbers

# 3) Data import ---------------------------------------------------------------

# Set working directory to source file location
setwd("C:/Users/I0343303/Documents/Forschung/TOSCAN2.0")

# Etc.

💡 Use ---- after numbered headers in comments to make your code more navigable and readable in long scripts (this is a common R style convention).

1.5 Objects, data types, variables

In R, everything is an object. This means that every piece of data you work with, from a single number to a complex dataset, is represented as an object with specific properties and behaviours. An object has attributes like class (data type) and dimensions.

Variables act as labels for objects. They are essentially pointers to the actual object stored in memory and appear in the Environment tab in RStudio.

Here’s an example to clarify the difference between variables and objects.

# We create an object (here: a vector) named 'vec' and assign a sequence of numbers to it.
vec <- 1:10 

# 'vec' is the variable. The sequence of numbers (1, 2, 3, ..., 10) is the object.

💡 To assign values to an object, use the <- or = symbols.

1.5.1 Inspect an object

Before diving into data types and structures, it’s helpful to know how to inspect objects in R. Several built-in functions can help you understand the structure and content of an object.

Let’s define a simple data frame (more details about data frames below) to demonstrate the purpose of those functions.

# Example data frame
df <- data.frame(
  ID = 1:5,
  Name = c("Anna", "Ben", "Carla", "David", "Eva"),
  Age = c(23, 31, 29, 40, 35)
)

Now let’s inspect thus object using a few useful functions.

# General object inspection
typeof(df)          # Returns the internal storage type of the object
## [1] "list"
str(df)             # Gives a compact, human-readable summary of the object's structure
## 'data.frame':    5 obs. of  3 variables:
##  $ ID  : int  1 2 3 4 5
##  $ Name: chr  "Anna" "Ben" "Carla" "David" ...
##  $ Age : num  23 31 29 40 35
attributes(df)      # Lists the object's attributes (e.g., names, dimensions, class)
## $names
## [1] "ID"   "Name" "Age" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5
str(attributes(df)) # Displays the structure of the attributes
## List of 3
##  $ names    : chr [1:3] "ID" "Name" "Age"
##  $ class    : chr "data.frame"
##  $ row.names: int [1:5] 1 2 3 4 5
class(df)           # Returns the class of the object (e.g., data.frame)
## [1] "data.frame"
# Functions especially useful for matrices or data frames
nrow(df)          # Number of rows
## [1] 5
ncol(df)          # Number of columns
## [1] 3
dim(df)           # Dimensions (rows, columns)
## [1] 5 3
colnames(df)      # Column names
## [1] "ID"   "Name" "Age"
rownames(df)      # Row names
## [1] "1" "2" "3" "4" "5"
str(colnames(df)) # Structure of the column names (e.g., character vector)
##  chr [1:3] "ID" "Name" "Age"
str(rownames(df)) # Structure of the row names (e.g., character vector)
##  chr [1:5] "1" "2" "3" "4" "5"

💡 The function str() provides a compact view of the internal structure of an R object, helping you understand its components and data types quickly.

1.5.2 Data types

In R, data types define the kind of information a variable can hold. Here are some of the most common data types:

1.5.2.1 Numeric: Represents real numbers (e.g., 3.14, -2.5, 0).

typeof(3.14)
## [1] "double"
str(3.14)
##  num 3.14

1.5.2.2 Integer: Represents whole numbers (e.g., 2L, -5L). The “L” suffix indicates an integer.

typeof(2L) 
## [1] "integer"
str(2L)
##  int 2

1.5.2.3 Logical: Represents Boolean values (TRUE or FALSE).

typeof(TRUE)
## [1] "logical"
str(TRUE)
##  logi TRUE
# Example of a logical operation
values <- 1:10
above_five <- (values > 5)
above_five
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

✏️ Exercise on Booleans: Given a vector of ages (ages <- c(35, 45, 60, 15, 50, 8)), determine which patients are eligible for a treatment (age above 18). Return a Boolean vector indicating whether each patient meets the age criteria.

1.5.2.4 Character: Represents text or strings (e.g., “hello”, “world”).

typeof("hello")
## [1] "character"
str("hello")
##  chr "hello"

1.5.2.5 Complex: Represents complex numbers (e.g., 1 + 2i).

typeof(1 + 2i) 
## [1] "complex"
str(1 + 2i)
##  cplx 1+2i

1.5.2.6 NAs/NANs

In some cases the components of a vector may not be known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA.

z <- c(1:3,NA)
print(z)
## [1]  1  2  3 NA
is.na(z)
## [1] FALSE FALSE FALSE  TRUE

There is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN, values.

0/0
## [1] NaN
Inf - Inf
## [1] NaN

1.5.3 Objects

Objects are the entities that R operates on. These can be:

1.5.3.1 Vectors

  • The most fundamental data structure.
  • A one-dimensional array of elements of the same data type (e.g., numeric, character, logical).
  • Created using the c() function.
vec1 <- c(1,2,3)
str(vec1)
##  num [1:3] 1 2 3
# Alternative ways of creating vectors:
vec2 <- 1:3  # Sequence of integers
vec3 <- seq(1, 3, by=1)  # More general sequence

Vector elements can be accessed using [] brackets.

# Accessing elements of the vector by index (R uses 1-based indexing)
vec1[1]  # First element
## [1] 1
vec2[c(2, 3)]  # Elements at indices 2 and 3
## [1] 2 3
vec3[c(-2)]  # All elements except for the element at index 2
## [1] 1 3

1.5.3.2 Matrices

  • Two-dimensional arrays of elements of the same data type.
  • Can be created using the matrix() function.
(mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
str(mat)
##  num [1:2, 1:2] 1 2 3 4

💡 By enclosing the assignment in parentheses (), you not only create the object but also automatically print its value to the console — a useful shortcut. This is equivalent to writing print(object) or simply typing the object name (e.g., object), but it saves you an extra line of code.

1.5.3.3 Arrays

  • Generalization of matrices to more than two dimensions.
(array <- array(1:8, c(2,4,2)))
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
str(array)
##  int [1:2, 1:4, 1:2] 1 2 3 4 5 6 7 8 1 2 ...

1.5.3.4 Lists

  • Ordered collections of objects, which may be of different types.
  • Lists can contain other lists as elements.
(list <- list(numb = 10:15, char = 'hello'))
## $numb
## [1] 10 11 12 13 14 15
## 
## $char
## [1] "hello"
str(list)
## List of 2
##  $ numb: int [1:6] 10 11 12 13 14 15
##  $ char: chr "hello"

1.5.3.5 Factors

  • Categorical variables.
  • Represent data with a limited number of possible values.
(fac <- factor(c("single", "married")))
## [1] single  married
## Levels: married single
str(fac)
##  Factor w/ 2 levels "married","single": 2 1

1.5.3.6 Data Frames

  • Two-dimensional tabular data structure.
  • Can contain columns of different data types, such as numeric, character, logical, or factor.
  • The most common data structure for representing datasets in R.
  • Technically, a data frame is a special type of list where each element (i.e., column) is of the same length and can be a vector (numeric, character, logical), factor, matrix, list, or even another data frame.
(d <- data.frame(id = 1:5, 
                 val = c(4,5,2,6,5),
                 group = sample(c("exp","control"), size = 5, replace = TRUE)))
str(d)
## 'data.frame':    5 obs. of  3 variables:
##  $ id   : int  1 2 3 4 5
##  $ val  : num  4 5 2 6 5
##  $ group: chr  "control" "exp" "exp" "control" ...

✏️ Create a data frame with 10 rows and the columns id, blood_pressure, and group. – id: integers from 1 to 10 – blood_pressure: random values from a normal distribution with mean 123 and standard deviation 8 – group: a factor with levels “drug 1”, “drug 2”, and “obs arm” (you decide how to assign them, e.g. by using the function sample()) After creating that data frame, add a new column stand_score where you calculate a standardized score for each blood_pressure value. The standardized score is similar to a z-score but is calculated based on the mean and standard deviation of the blood_pressure values in the dataset (standardized score = (x−μ)/σ)).

💡 Use the function rnorm() to simulate normal values. Use the function scale() to centre and scale a vector, or alternatively the functions mean() and sd() to compute mean and standard deviation of a vector. You can use help() to learn more about how these functions work.

1.5.3.7 Functions

  • Reusable blocks of code that perform a specific task.
  • Also considered objects in R.
frac <- function(numerator, denominator) {
  result <- numerator / denominator
  return(result)
}

frac(6, 2)  # Calling the function
## [1] 3

✏️ Write a function sum_squared that takes two integers and returns the sum of their squared values.

# Example
sum_squared(2,3)
# The output should be:
13

2 Dealing with data sets

2.1 Working directory

When you get a file from somewhere on your computer (e.g. a dataset), you can either

  • have it in your R working directory (see below), in which case you don’t need to specify the full path to the file when you import it,
  • or you can get files in different folders, by always specifying the full path.

The advantage of putting the files in the folder that contains your script and is set as the working directory is that you can easily move the folder around on your computer without getting any problems with your script: just set the working directory to your source file every time you open it, and you’ll be fine.

# Example
setwd("~/path/to/your/folder/")
data <- read.csv("testdata.csv")

The advantage of always giving the full path to a file is that you can get data in different folders on your computer, avoiding things like copying the source data in every folder where you have a corresponding script.

# Example
data <- read.csv("~/path/to/your/folder/testdata.csv")

To find out what your current working directory is, you can use the function getwd().

getwd()
## [1] "/home/claire/Documents/GitHub/rforphysicians/docs"

Working directory

To tell R which folder you are working in (e.g., where your data is stored), you have several options:

  • Go to Session > Set Working Directory > Choose Directory and select your folder manually.
  • Use setwd("path/to/your/folder") in your script.
  • Or, the most convenient for script-based work: go to Session > Set Working Directory > To Source File Location to automatically set the working directory to the location of your script.

💡 I recommend placing both your script and your data files in the same folder, and setting that folder as your working directory. This helps avoid errors caused by R not finding your data.

getwd()                       # Displays the current working directory
setwd("path/to/your/folder")  # Sets the working directory

2.2 Importing data

We will first look at how to import a CSV file into R as a data frame.

CSV stands for Comma-Separated Values. In a .csv file, the values are stored as plain text, separated by commas. This is a simple and widely used format for storing tabular data.

After setting your working directory or determining the path to your CSV file, you can use the read.csv() function to import the data. This will create a data frame, which is one of the most commonly used structures in R for handling datasets.

# Import a CSV file into a data frame
dataset <- read.csv("~/path/to/your/folder/data.csv")

💡 I recommend using data frames — they are generally easier to work with than matrices, especially for beginners.

Another widely used data format is the Excel file (.xlsx). For these, you can use the readxl package to import the data:

# Load the readxl package
library(readxl)

# Read the first sheet of an Excel file
dataset <- read_excel("~/path/to/your/folder/data.xlsx")

⚠️ Note: If your file is actually a CSV but mistakenly has a .xlsx extension, you should rename it to .csv and use read.csv() instead. Mixing up formats can lead to import errors.

2.3 Handling data frames

Let us now look at real data frames to learn how to call or modify their elements. To do this, we will use multiple health data sets from the National Health and Nutrition Examination (NHANES) Survey from 2011-2012. The survey assessed overall health and nutrition of adults and children in the United States and was conducted by the National Center for Health Statistics (NCHS). The data sets can be found in the data_sets folder folder.

Dataset NHANES Code Description CSV File
Demographics DEMO_G Age, sex, race/ethnicity, income, education DEMO_G.csv
Blood Pressure BPX_G Systolic/diastolic blood pressure, number of readings BPX_G.csv
Body Measures BMX_G Height, weight, BMI, waist circumference BMX_G.csv
Smoking Questionnaire SMQ_G Smoking habits, exposure to secondhand smoke SMQ_G.csv
# Load the necessary CSV files into data frames
demo <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/DEMO_G.csv") # Demographics (cycle G = 2011–2012)
bpx  <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BPX_G.csv") # Blood pressure
bmx  <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BMX_G.csv") # Body measures
smq  <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/SMQ_G.csv") # Smoking questionnaire

# Check the structure of the data frames
str(demo)
## 'data.frame':    9756 obs. of  48 variables:
##  $ SEQN    : int  62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
##  $ SDDSRVYR: chr  "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" ...
##  $ RIDSTATR: chr  "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" ...
##  $ RIAGENDR: chr  "Male" "Female" "Male" "Female" ...
##  $ RIDAGEYR: int  22 3 14 44 14 9 0 6 21 15 ...
##  $ RIDAGEMN: int  NA NA NA NA NA NA 11 NA NA NA ...
##  $ RIDRETH1: chr  "Non-Hispanic White" "Mexican American" "Other Race - Including Multi-Racial" "Non-Hispanic White" ...
##  $ RIDRETH3: chr  "Non-Hispanic White" "Mexican American" "Non-Hispanic Asian" "Non-Hispanic White" ...
##  $ RIDEXMON: chr  "May 1 through October 31" "November 1 through April 30" "May 1 through October 31" "November 1 through April 30" ...
##  $ RIDEXAGY: int  NA 3 14 NA 14 10 NA 6 NA 15 ...
##  $ RIDEXAGM: int  NA 41 177 NA 179 120 12 81 NA 181 ...
##  $ DMQMILIZ: chr  "No" NA NA "Yes" ...
##  $ DMQADFC : chr  NA NA NA "No" ...
##  $ DMDBORN4: chr  "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" ...
##  $ DMDCITZN: chr  "Citizen by birth or naturalization" "Citizen by birth or naturalization" "Citizen by birth or naturalization" "Citizen by birth or naturalization" ...
##  $ DMDYRSUS: chr  NA NA NA NA ...
##  $ DMDEDUC3: chr  NA NA "8th grade" NA ...
##  $ DMDEDUC2: chr  "High school graduate/GED or equivalent" NA NA "Some college or AA degree" ...
##  $ DMDMARTL: chr  "Never married" NA NA "Married" ...
##  $ RIDEXPRG: chr  NA NA NA "The participant was not pregnant at exam" ...
##  $ SIALANG : chr  "English" "English" "English" "English" ...
##  $ SIAPROXY: chr  "Yes" "Yes" "Yes" "No" ...
##  $ SIAINTRP: chr  "No" "No" "No" "No" ...
##  $ FIALANG : chr  "English" "English" "English" "English" ...
##  $ FIAPROXY: chr  "No" "No" "No" "No" ...
##  $ FIAINTRP: chr  "No" "No" "No" "No" ...
##  $ MIALANG : chr  "English" NA "English" NA ...
##  $ MIAPROXY: chr  "No" NA "No" NA ...
##  $ MIAINTRP: chr  "No" NA "No" NA ...
##  $ AIALANGA: chr  "English" NA "English" NA ...
##  $ WTINT2YR: num  102641 15458 7398 127351 12210 ...
##  $ WTMEC2YR: num  104237 16116 7869 127965 13384 ...
##  $ SDMVPSU : int  1 3 3 1 2 1 2 2 1 3 ...
##  $ SDMVSTRA: int  91 92 90 94 90 91 92 103 92 91 ...
##  $ INDHHIN2: chr  "$75,000 to $99,999" "$15,000 to $19,999" "$100,000 and Over" "$45,000 to $54,999" ...
##  $ INDFMIN2: chr  "$75,000 to $99,999" "$15,000 to $19,999" "$100,000 and Over" "$45,000 to $54,999" ...
##  $ INDFMPIR: num  3.15 0.6 4.07 1.67 0.57 NA NA 3.48 0.33 5 ...
##  $ DMDHHSIZ: int  5 6 5 5 5 6 7 5 5 4 ...
##  $ DMDFMSIZ: int  5 6 5 5 5 6 4 5 5 4 ...
##  $ DMDHHSZA: int  0 2 0 1 1 0 3 0 0 0 ...
##  $ DMDHHSZB: int  1 2 2 2 2 4 3 2 1 2 ...
##  $ DMDHHSZE: int  0 0 1 0 0 0 1 1 0 0 ...
##  $ DMDHRGND: chr  "Female" "Female" "Male" "Male" ...
##  $ DMDHRAGE: int  50 24 42 52 33 44 61 43 51 38 ...
##  $ DMDHRBR4: chr  "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" ...
##  $ DMDHREDU: chr  "College Graduate or above" "High School Grad/GED or Equivalent" "College Graduate or above" "Some College or AA degree" ...
##  $ DMDHRMAR: chr  "Married" "Living with partner" "Married" "Married" ...
##  $ DMDHSEDU: chr  "College Graduate or above" NA "Some College or AA degree" "Some College or AA degree" ...
str(bpx)
## 'data.frame':    9338 obs. of  27 variables:
##  $ SEQN    : int  62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
##  $ PEASCST1: chr  "Complete" "Complete" "Complete" "Complete" ...
##  $ PEASCTM1: int  596 64 788 527 468 583 55 98 1005 625 ...
##  $ PEASCCT1: chr  NA NA NA NA ...
##  $ BPXCHR  : int  NA 100 NA NA NA NA 100 96 NA NA ...
##  $ BPQ150A : chr  "No" NA "Yes" "Yes" ...
##  $ BPQ150B : chr  "No" NA "No" "No" ...
##  $ BPQ150C : chr  "No" NA "No" "No" ...
##  $ BPQ150D : chr  "No" NA "No" "No" ...
##  $ BPAARM  : chr  "Right" NA "Right" "Right" ...
##  $ BPACSZ  : chr  "Large (15X32)" NA "Adult (12X22)" "Adult (12X22)" ...
##  $ BPXPLS  : int  82 NA 72 82 70 90 NA NA 72 62 ...
##  $ BPXPULS : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ BPXPTY  : chr  "Radial" NA "Radial" "Radial" ...
##  $ BPXML1  : int  130 NA 140 140 130 120 NA NA 140 140 ...
##  $ BPXSY1  : int  110 NA 112 116 110 96 NA NA 124 124 ...
##  $ BPXDI1  : int  82 NA 38 56 64 32 NA NA 80 82 ...
##  $ BPAEN1  : chr  "No" NA "No" "No" ...
##  $ BPXSY2  : int  104 NA 108 118 104 94 NA NA 126 122 ...
##  $ BPXDI2  : int  68 NA 36 66 72 40 NA NA 74 84 ...
##  $ BPAEN2  : chr  "No" NA "No" "No" ...
##  $ BPXSY3  : int  118 NA 106 120 106 94 NA NA 124 128 ...
##  $ BPXDI3  : int  74 NA 38 58 78 0 NA NA 80 82 ...
##  $ BPAEN3  : chr  "No" NA "No" "No" ...
##  $ BPXSY4  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPXDI4  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPAEN4  : chr  NA NA NA NA ...
str(bmx) 
## 'data.frame':    9338 obs. of  26 variables:
##  $ SEQN    : int  62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
##  $ BMDSTATS: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMXWT   : num  69.2 12.7 49.4 67.2 69.1 28.8 10.8 23.6 54.6 63.5 ...
##  $ BMIWT   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXRECUM: num  NA 95.7 NA NA NA NA 79.5 NA NA NA ...
##  $ BMIRECUM: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXHEAD : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMIHEAD : logi  NA NA NA NA NA NA ...
##  $ BMXHT   : num  172.3 94.7 168.9 170.1 159.4 ...
##  $ BMIHT   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXBMI  : num  23.3 14.2 17.3 23.2 27.2 16.2 NA 15.4 20.1 18.2 ...
##  $ BMDBMIC : int  NA 2 2 NA 3 2 NA 2 NA 2 ...
##  $ BMXLEG  : num  40.2 NA 40.3 40.5 42.1 31 NA NA 38.7 43.3 ...
##  $ BMILEG  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXARML : num  35 18.5 36.3 37.2 35.2 28 16.2 24.8 33.4 37.5 ...
##  $ BMIARML : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXARMC : num  32.5 16.6 22 29.3 29.7 19.1 15.5 17.1 28.5 25.8 ...
##  $ BMIARMC : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXWAIST: num  81 45.4 64.6 80.1 86.7 59.8 NA 54.4 69.6 69.4 ...
##  $ BMIWAIST: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXSAD1 : num  17.7 NA 15.6 18.3 21 13.5 NA NA 16.4 14.8 ...
##  $ BMXSAD2 : num  17.9 NA 15.5 18.5 20.8 13.5 NA NA 16.3 14.7 ...
##  $ BMXSAD3 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMXSAD4 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ BMDAVSAD: num  17.8 NA 15.6 18.4 20.9 13.5 NA NA 16.4 14.8 ...
##  $ BMDSADCM: int  NA NA NA NA NA NA NA NA NA NA ...
str(smq) 
## 'data.frame':    6790 obs. of  30 variables:
##  $ SEQN    : int  62161 62163 62164 62165 62169 62170 62171 62172 62174 62176 ...
##  $ SMQ020  : chr  "No" NA "No" NA ...
##  $ SMD030  : int  NA NA NA NA NA NA NA 28 NA NA ...
##  $ SMQ040  : chr  NA NA NA NA ...
##  $ SMQ050Q : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SMQ050U : chr  NA NA NA NA ...
##  $ SMD055  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SMD057  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SMQ077  : chr  NA NA NA NA ...
##  $ SMD641  : int  NA NA NA NA NA NA NA 30 NA NA ...
##  $ SMD650  : int  NA NA NA NA NA NA NA 10 NA NA ...
##  $ SMD093  : chr  NA NA NA NA ...
##  $ SMDUPCA : chr  "" "" "" "" ...
##  $ SMD100BR: chr  "" "" "" "" ...
##  $ SMD100FL: chr  NA NA NA NA ...
##  $ SMD100MN: chr  NA NA NA NA ...
##  $ SMD100LN: chr  NA NA NA NA ...
##  $ SMD100TR: int  NA NA NA NA NA NA NA 6 NA NA ...
##  $ SMD100NI: num  NA NA NA NA NA NA NA 0.6 NA NA ...
##  $ SMD100CO: int  NA NA NA NA NA NA NA 6 NA NA ...
##  $ SMQ621  : chr  NA "I have never smoked, not even a puff" NA "I have never smoked, not even a puff" ...
##  $ SMD630  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SMQ660  : chr  NA NA NA NA ...
##  $ SMQ664M : chr  NA NA NA NA ...
##  $ SMQ664C : chr  NA NA NA NA ...
##  $ SMQ664W : chr  NA NA NA NA ...
##  $ SMQ664B : logi  NA NA NA NA NA NA ...
##  $ SMQ664O : chr  NA NA NA NA ...
##  $ SMQ670  : chr  NA NA NA NA ...
##  $ SMAQUEX2: chr  "Home Interview (20+ Yrs)" "A-CASI (12 - 19 Yrs)" "Home Interview (20+ Yrs)" "A-CASI (12 - 19 Yrs)" ...

✏️ Exercise on the NHANES data sets n°1: import the demo, bpx, bmx and smq data sets from the data_sets folder folder into R.

💡 The codebook for each dataset can be accessed either on the NCHS website or directly in R using the function nhanesCodebook(nh_table, colname) from the package nhanesA (which I used to download the data). You’ll find more details about installing packages at the end of this chapter.

2.3.1 Accessing elements in data frames

Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.

# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")]  # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars]  # Alternative using variable `vars`
# Select elements by position
demo[1, 1]  # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat]  # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"                           
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50)  # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3))  # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ]  # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ]  # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ]  # Rows where total number of people in the household greater or equal 3
# Combine logical vectors using "&" (AND), "|" (OR), and "!" (NOT)
demo[(demo[, "RIDAGEYR"] > 50 & demo[, "RIAGENDR"] == "Female"), ]  # Both conditions must be true
demo[(demo[, "DMDHHSIZ"] < 3 | demo[, "RIAGENDR"] == "Male"), ]  # One condition must be true

💡 To inspect one column, you can also use the dollar $ symbol to access a column as a vector.

head(demo$RIDAGEYR)  # Returns the age column as a vector
## [1] 22  3 14 44 14  9

💡 You can use the brackets [] to select specific rows and columns. Since data frames are bi-dimensional, the first index refers to rows and the second to columns. To select a particular column, you can omit the row index. To select a particular row, omit the column index.

head(demo[, "RIDAGEYR"])  # All rows in the age column
## [1] 22  3 14 44 14  9
demo[1, ]  # Row 1 (all columns)

✏️ Exercise on the NHANES data sets n°2: inspect the structure of the demo data set, look at different entries, get familiar with those commands.

2.3.2 Basic descriptive statistics

R makes it simple to compute basic descriptive statistics for exploring your dataset. Below are a few useful examples.

2.3.2.1 Central tendency: mean and median

mean(demo$RIDAGEYR, na.rm = TRUE)                 # Average (mean) age of participants
## [1] 31.40262
median(demo$DMDHHSIZ, na.rm = TRUE)               # Median household size
## [1] 4

💡 The na.rm argument in those functions allows for ignoring the NA values.

2.3.2.2 Dispersion: standard deviation, min, max, and range

sd(demo$RIDAGEYR, na.rm = TRUE)                     # Standard deviation of age
## [1] 24.57899
range(demo$DMDHHSZA, na.rm = TRUE)                  # Range of number of young children
## [1] 0 3
min(demo$DMDHRAGE, na.rm = TRUE)                    # Minimum age of household reference person
## [1] 18
max(demo$DMDHRAGE, na.rm = TRUE)                    # Maximum age of household reference person
## [1] 80

2.3.2.3 Frequency tables and proportions

table(demo$RIAGENDR)                                # Gender distribution
## 
## Female   Male 
##   4900   4856
table(demo$DMDHRMAR)                                # Marital status of household reference person
## 
##            Divorced          Don't Know Living with partner             Married 
##                 816                   8                 876                5282 
##       Never married             Refused           Separated             Widowed 
##                1539                  82                 436                 581
table(demo$AIALANGA)                                # Language of interview
## 
## Asian languages         English         Spanish 
##              95            5192             467
prop.table(table(demo$AIALANGA))                    # Proportional distribution of interview languages
## 
## Asian languages         English         Spanish 
##      0.01651025      0.90232881      0.08116093

2.3.2.4 Group-wise summaries

aggregate(DMDHRAGE ~ DMDHRMAR, data = demo, FUN = mean, na.rm = TRUE)   # Mean age of household reference person by marital status
aggregate(DMDHHSIZ ~ DMDHRGND, data = demo, FUN = median, na.rm = TRUE) # Median household size by gender of reference person

2.3.2.5 Full overview

summary(demo)
##       SEQN         SDDSRVYR           RIDSTATR           RIAGENDR        
##  Min.   :62161   Length:9756        Length:9756        Length:9756       
##  1st Qu.:64600   Class :character   Class :character   Class :character  
##  Median :67038   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :67038                                                           
##  3rd Qu.:69477                                                           
##  Max.   :71916                                                           
##                                                                          
##     RIDAGEYR       RIDAGEMN       RIDRETH1           RIDRETH3        
##  Min.   : 0.0   Min.   : 0.00   Length:9756        Length:9756       
##  1st Qu.: 9.0   1st Qu.: 4.00   Class :character   Class :character  
##  Median :26.0   Median : 9.00   Mode  :character   Mode  :character  
##  Mean   :31.4   Mean   :10.03                                        
##  3rd Qu.:52.0   3rd Qu.:16.00                                        
##  Max.   :80.0   Max.   :24.00                                        
##                 NA's   :9130                                         
##    RIDEXMON            RIDEXAGY         RIDEXAGM       DMQMILIZ        
##  Length:9756        Min.   : 2.000   Min.   :  0.0   Length:9756       
##  Class :character   1st Qu.: 5.000   1st Qu.: 42.0   Class :character  
##  Mode  :character   Median : 9.000   Median : 99.0   Mode  :character  
##                     Mean   : 9.641   Mean   :104.2                     
##                     3rd Qu.:14.000   3rd Qu.:160.0                     
##                     Max.   :20.000   Max.   :239.0                     
##                     NA's   :6338     NA's   :5747                      
##    DMQADFC            DMDBORN4           DMDCITZN           DMDYRSUS        
##  Length:9756        Length:9756        Length:9756        Length:9756       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    DMDEDUC3           DMDEDUC2           DMDMARTL           RIDEXPRG        
##  Length:9756        Length:9756        Length:9756        Length:9756       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    SIALANG            SIAPROXY           SIAINTRP           FIALANG         
##  Length:9756        Length:9756        Length:9756        Length:9756       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    FIAPROXY           FIAINTRP           MIALANG            MIAPROXY        
##  Length:9756        Length:9756        Length:9756        Length:9756       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    MIAINTRP           AIALANGA            WTINT2YR         WTMEC2YR     
##  Length:9756        Length:9756        Min.   :  3321   Min.   :     0  
##  Class :character   Class :character   1st Qu.: 11352   1st Qu.: 11174  
##  Mode  :character   Mode  :character   Median : 18098   Median : 18090  
##                                        Mean   : 31426   Mean   : 31426  
##                                        3rd Qu.: 34887   3rd Qu.: 34792  
##                                        Max.   :220233   Max.   :222580  
##                                                                         
##     SDMVPSU         SDMVSTRA        INDHHIN2           INDFMIN2        
##  Min.   :1.000   Min.   : 90.00   Length:9756        Length:9756       
##  1st Qu.:1.000   1st Qu.: 92.00   Class :character   Class :character  
##  Median :2.000   Median : 96.00   Mode  :character   Mode  :character  
##  Mean   :1.643   Mean   : 95.87                                        
##  3rd Qu.:2.000   3rd Qu.: 99.00                                        
##  Max.   :3.000   Max.   :103.00                                        
##                                                                        
##     INDFMPIR        DMDHHSIZ        DMDFMSIZ        DMDHHSZA    
##  Min.   :0.000   Min.   :1.000   Min.   :1.000   Min.   :0.000  
##  1st Qu.:0.860   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.000  
##  Median :1.630   Median :4.000   Median :4.000   Median :0.000  
##  Mean   :2.206   Mean   :3.761   Mean   :3.591   Mean   :0.531  
##  3rd Qu.:3.580   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :7.000   Max.   :7.000   Max.   :3.000  
##  NA's   :840                                                    
##     DMDHHSZB         DMDHHSZE       DMDHRGND            DMDHRAGE    
##  Min.   :0.0000   Min.   :0.000   Length:9756        Min.   :18.00  
##  1st Qu.:0.0000   1st Qu.:0.000   Class :character   1st Qu.:33.00  
##  Median :1.0000   Median :0.000   Mode  :character   Median :43.00  
##  Mean   :0.9318   Mean   :0.395                      Mean   :45.39  
##  3rd Qu.:2.0000   3rd Qu.:1.000                      3rd Qu.:56.00  
##  Max.   :4.0000   Max.   :3.000                      Max.   :80.00  
##                                                                     
##    DMDHRBR4           DMDHREDU           DMDHRMAR           DMDHSEDU        
##  Length:9756        Length:9756        Length:9756        Length:9756       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
## 

2.3.3 Modifying data frames

Now, let us assume we want to modify/add/remove one or multiple entries/rows/columns in our data frame. The brackets really come in handy now. In this setting, I recommend defining a new data frame before modifying the original one.

Some examples follow.

# Modify one entry:
demo_mod <- demo  # Create a copy to avoid modifying the original data set
demo_mod[1, 1:5]
demo_mod[1, "RIAGENDR"] <- 'Female' # Change gender of the first participant
demo_mod[1, 1:5]
# Modify multiple entries based on a condition:
demo_mod[1:10, 1:5]
demo_mod[!is.na(demo_mod$RIDAGEYR) & demo_mod$RIDAGEYR < 18, ]$RIDAGEYR <- 18  # Set minimum age to 18
demo_mod[1:10, 1:5]

✏️ Exercise on the NHANES data sets n°3: generate a new data frame selecting only the female patients that are above 18 years old and that took the interview in Spanish.

2.3.4 Combining data frames

In practice, data is often spread across multiple data frames that need to be combined. Depending on the structure and goal, there are different ways to combine data frames:

2.3.4.1 Column binding with cbind()

Use cbind() to add columns side-by-side. The data frames must have the same number of rows.

# Extract one column from demo to create an additional data frame with the same number of rows
extra_info <- demo$RIDRETH1

# Combine using cbind
combined_df <- cbind(demo, extra_info)
combined_df[1, ]

2.3.4.2 Row binding with rbind()

Use rbind() to stack data frames vertically. The data frames must have the same column names and types.

# Extract one column from demo to create an additional data frame with the same structure (column names and types)
(new_participant <- demo[18,])
# Combine using rbind
extended_df <- rbind(demo, new_participant)
extended_df[nrow(extended_df),]

2.3.4.3 Merging with merge() (Join)

Use merge() to combine data frames based on a common column, similar to SQL joins (see figure below for a reminder on the different types of joins).

# Merge by participant ID `SEQN` (inner join by default)
merged_df <- merge(demo, bmx, by = "SEQN")
head(merged_df)

You can also specify the type of join:

  • Left join: keep all rows from demo: left_join_df <- merge(demo, bmx, by = "SEQN", all.x = TRUE)
  • Left join: keep all rows from bmx: left_join_df <- merge(demo, bmx, by = "SEQN", all.y = TRUE)
  • Full outer join: keep all rows from both data frames: full_join_df <- merge(demo, bmx, by = "SEQN", all = TRUE)

2.3.4.4 Dealing with NAs

Handling missing data (NAs) is a common task in data analysis. Before deciding how to treat them, it’s important to understand where and how often they occur.

colSums(is.na(demo))              # Number of NAs per column
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 
##        0        0        0        0        0     9130        0        0 
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ  DMQADFC DMDBORN4 DMDCITZN DMDYRSUS 
##      418     6338     5747     3749     9205        0        5     7683 
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG  SIALANG SIAPROXY SIAINTRP  FIALANG 
##     7157     4196     4196     8548        0        6        0      105 
## FIAPROXY FIAINTRP  MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR 
##      105      105     3043     3043     3043     4002        0        0 
##  SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA 
##        0        0       81       51      840        0        0        0 
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU 
##        0        0        0        0      365      362      136     4881
sum(complete.cases(demo))         # Number of rows without any NAs
## [1] 0

💡 is.na() returns a logical matrix where TRUE indicates a missing value (NA) and FALSE indicates a non-missing value. colSums() takes this logical matrix and sums up the TRUE values (which are treated as 1), giving you the count of missing values for each column.

💡 complete.cases() returns a logical vector: TRUE if a row has no missing values, and FALSE otherwise. Using sum(complete.cases(...)) counts the number of rows with no missing data.

One way to handle missing data is to remove rows containing NAs for the variable(s) you are interested in. This can be appropriate in some cases, but it should be done with care, as it may introduce bias or reduce sample size. We’ll discuss this further in chapter 4.

# Remove rows with any missing values in the DMDHRMAR column
demo_DMDHRMAR <- demo[!is.na(demo$DMDHRMAR), ]

# Check for missing values
sum(is.na(demo_DMDHRMAR$DMDHRMAR))
## [1] 0

You can remove all rows with missing values across any of the columns in the dataset using the function na.omit().

# Remove rows with missing values in any column
demo_no_na <- na.omit(demo)

# Check the resulting data frame and its structure
head(demo_no_na)

💡 For our demo data set, this removes all the rows! Another reminder to be very careful when removing NA values.

3 Packages

There are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets.

Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality. There are 10,000+ user contributed packages and growing. You can install packages using the install.packages() function.

# To install a package, you can use the function 
install.packages("dplyr")

# To load a package you've already install, just load it using the library() function
library(dyplr)

✏️ Exercise on packages: install the ggplot2 package. We will need it for Chapter 2.

4 Solutions to the exercises

Please not that those are only examples, there are always many ways to solve the same task!

☑️ Create your own script: I don’t think any solution is necessary 😉.

☑️ Exercise on Booleans:

ages <- c(35, 45, 60, 15, 50, 8)
eligible <- ages >= 18
eligible
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

☑️ Create a data frame and compute standardized scores of blood_pressure:

# Create the data frame
(df <- data.frame(id = 1:10, 
                 blood_pressure = rnorm(10, mean = 123, sd = 8),
                 group = factor(sample(c("drug 1", "drug 2", "obs arm"), 10, replace = TRUE))))
# Compute standardized scores for blood pressure
df$stand_score <- scale(df$blood_pressure)

# Alternatively, compute by hand (standardized score = (x - μ) / σ)
df$stand_score_2 <- (df$blood_pressure - mean(df$blood_pressure))/sd(df$blood_pressure)

# View the data frame with standardized scores
df

☑️ Write a function sum_squared:

sum_squared <- function(int1, int2){
  sum_of_squares <- int1^2 + int2^2
  return(sum_of_squares)
}

sum_squared(3, 5)
## [1] 34

✏️ Exercise on the NHANES data sets n°1: If you have trouble importing the data sets into R, let me know, I’d be glad to help.

✏️ Exercise on the NHANES data sets n°2: Did you manage to select a specific column you were interested in? Were you able to check how many men and women are included in the data set? Here are a few examples of operations you can use to explore the data set.

# View the first few rows of the dataset
head(demo)
# Display the names of all columns
names(demo)
##  [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
##  [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGY" "RIDEXAGM" "DMQMILIZ"
## [13] "DMQADFC"  "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2"
## [19] "DMDMARTL" "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG" 
## [25] "FIAPROXY" "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA"
## [31] "WTINT2YR" "WTMEC2YR" "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2"
## [37] "INDFMPIR" "DMDHHSIZ" "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE"
## [43] "DMDHRGND" "DMDHRAGE" "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU"
# View all unique values in the "AIALANGA" (Interview Language) column
unique(demo$AIALANGA)
## [1] "English"         NA                "Spanish"         "Asian languages"
# Count how many men and women are in the dataset
table(demo$RIAGENDR)
## 
## Female   Male 
##   4900   4856
# Get a quick statistical summary of a few columns
summary(demo[,c('RIDAGEYR','DMDHHSZA','DMDHRAGE')])
##     RIDAGEYR       DMDHHSZA        DMDHRAGE    
##  Min.   : 0.0   Min.   :0.000   Min.   :18.00  
##  1st Qu.: 9.0   1st Qu.:0.000   1st Qu.:33.00  
##  Median :26.0   Median :0.000   Median :43.00  
##  Mean   :31.4   Mean   :0.531   Mean   :45.39  
##  3rd Qu.:52.0   3rd Qu.:1.000   3rd Qu.:56.00  
##  Max.   :80.0   Max.   :3.000   Max.   :80.00
# View the number of missing values in each column
colSums(is.na(demo))
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 
##        0        0        0        0        0     9130        0        0 
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ  DMQADFC DMDBORN4 DMDCITZN DMDYRSUS 
##      418     6338     5747     3749     9205        0        5     7683 
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG  SIALANG SIAPROXY SIAINTRP  FIALANG 
##     7157     4196     4196     8548        0        6        0      105 
## FIAPROXY FIAINTRP  MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR 
##      105      105     3043     3043     3043     4002        0        0 
##  SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA 
##        0        0       81       51      840        0        0        0 
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU 
##        0        0        0        0      365      362      136     4881
# Calculate the average age
mean(demo$RIDAGEYR, na.rm = TRUE)
## [1] 31.40262

✏️ Exercise on the NHANES data sets n°3:

# Filter the dataset to include only female patients that are above 18 years old and that took the interview in Spanish.
demo_filtered <- demo[demo$RIAGENDR == "Female" & 
                      demo$RIDAGEYR > 17 & 
                      !is.na(demo$AIALANGA) & demo$AIALANGA == "Spanish", ]
ind <- c('RIAGENDR', 'RIDAGEYR', 'AIALANGA')
demo_filtered[, ind]

💡 Note: When filtering on a variable that contains missing values (e.g., AIALANGA), you must explicitly exclude NAs using !is.na(...). This is because comparisons like demo$AIALANGA == “Spanish” return NA for missing values, not FALSE, so those rows aren’t properly excluded from the subset.

5 NHANES data sets

Here are listed some of the variables from the NHANES data sets used in this course that are being used in the examples.

demo data set
RIDAGEYR Participant’s age in years
RIAGENDR Participant’s gender
DMDHHSIZ Total number of people in the household
DMDHHSZA Number of children aged 5 or younger in the household
DMDHRAGE Age of the household reference person
DMDHRMAR Marital status of the household reference person
DMDHRGND Gender of the household reference person
AIALANGA Language of the interview

References

Alexander Henzi. 2021. “Programming and Data Analysis with R.” Lecture notes.
Burns, Patrick. n.d. The R Inferno. Accessed May 8, 2025. https://www.burns-stat.com/documents/books/the-r-inferno/.
ChatGPT.” n.d. Accessed January 26, 2025. https://chatgpt.com.
“Create Elegant Data Visualisations Using the Grammar of Graphics.” n.d. Accessed January 26, 2025. https://ggplot2.tidyverse.org/.
David, Author. 2016. BIRT Joins.” MBSE Chaos. https://mbsechaos.wordpress.com/2016/05/24/birt-joins/.
Elena Kosourova. n.d. RStudio Tutorial for Beginners: A Complete Guide.” Accessed January 26, 2025. https://www.datacamp.com/tutorial/r-studio-tutorial.
Grolemund, Hadley Wickham and Garrett. n.d. R for Data Science. Accessed May 8, 2025. https://r4ds.had.co.nz/introduction.html.
Mayer, Michael. 2025. “Mayer79/Statistical_computing_material.” https://github.com/mayer79/statistical_computing_material.
Patrick Burns. n.d. Impatient R. Accessed May 8, 2025. https://www.burns-stat.com/documents/tutorials/impatient-r/.
“Synthetic Dataset for AI in Healthcare.” n.d. Accessed May 9, 2025. https://www.kaggle.com/datasets/smmmmmmmmmmmm/synthetic-dataset-for-ai-in-healthcare.
“The Comprehensive R Archive Network.” n.d. Accessed January 26, 2025. https://stat.ethz.ch/CRAN/.
W. N. Venables, D. M. Smith and the R Core Team. n.d. “An Introduction to R.” Accessed May 8, 2025. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.
Wickham, Hadley. n.d. Advanced R. Accessed May 8, 2025. https://adv-r.hadley.nz/introduction.html.